Quick introduction

Our goal in this project is to discover the relationship between some variables of interest and their relationship with insurance charges. This dataset was taken from the book Machine Learning with R by Brett Lantz.

Link to the dataset: https://www.kaggle.com/mirichoi0218/insurance

Dataset content:

Data description: 1300+ entries of insurance charges with some other relevant data. A quick check shows that we have no missing values or duplicates, so there’s little data handling to be done.

The age column ranges from 18-64 To further simplify and make plotting easier, we’ll be grouping the entries into group ages:

Similar grouping with BMI:

EDA and relevant observations

We’ll now look further into the relationships between the variables in the dataset. To summarize, here are the interestings findings we’ve gathered, that might lead us to analyzing the relationships between:

##             age    bmi children charges
## age      1.0000 0.1093   0.0425  0.2990
## bmi      0.1093 1.0000   0.0128  0.1983
## children 0.0425 0.0128   1.0000  0.0680
## charges  0.2990 0.1983   0.0680  1.0000

Some other interesting obsevations we’ve found during this EDA:

  • BMI comparision between smokers and non-smokers: relatively similar, which is odd because smoking is known to increase your metabolism and make you lose weight. Then again, our dataset is looking at the demography who has to use their insurance, so there might be an underlying health factor at play.

  • The relationship might be uncovered by some confounding factors?

  • Charges comparision between regions: The southwest seems to be the cheaper region, compared to the others, while the mean of northeast is the highest || the distribution of southeast is skewed to higher values. The general trend is unrecognizable when we plot region against numerical variable.

  • BMI/Age/Insurance Charge: A closer look at the relationship between the variables we’re most concerned with. There seem to be a strong correlation between higher charge and obesity || old age. We’ll look deeper into this later in our toy linear model.

  • Smoking vs medical fees: It’s not much a surprise to see smokers have much higher insurance charge compared to non-smokers.

Regression task: Can you accurately predict insurance costs?

This task was taken from the book This objective might sound strange to most, but in the US, where people would prefer to drive themselves or call a taxi during emergencies rather than call the ambulance, it’s actually a very relevant problem. Where insurance services wildly vary in what they cover rises the service to estimate hospital bill.

Keep in mind, this dataset was simulated on the basis of demographic statistics from the US Census Bureau, according to the book from which it is from. But this can be used as a basis to develop an insight to the US medical insurance system.

The relationship between BMI and age with insurance charge

We’re to take a closer look between BMI, age and charge. These variables were selected because they are the main numeric variables in this dataset, which describe the most variance in the data.

After a quick look at the toy model we’ve built and their respective p-values, we can confirm there’s a correlation between the variables. Looking at the residual standard error and R-squared, it’s easy to see we were naive to think a linear model would suffice. With an F-statistic of 51 we might be able to assume a correlation here, and move on to implementing other models.

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -16801  -7203  -4979   6614  47739 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   13270.4      313.8  42.292  < 2e-16 ***
## bmi            2082.1      315.6   6.597 6.04e-11 ***
## age_groups.L   2971.2      329.1   9.029  < 2e-16 ***
## age_groups.Q   -349.8      327.4  -1.068    0.286    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11480 on 1334 degrees of freedom
## Multiple R-squared:  0.1037, Adjusted R-squared:  0.1017 
## F-statistic: 51.45 on 3 and 1334 DF,  p-value: < 2.2e-16

The simple linear regression model seems to have captured the correlation between age, bmi and charge. This further supports the idea of a relationship between them.

Comparing a few regression models’ performance in this dataset

The dataset was stratified spilitted into 80% train and 20% test. All the models were evaluated with a 10 fold cross validation. Below is the results of the models on the test sat.

We can see the overall winner with minimum tuning is random forest, followed by KNN. In this specific dataset, it seems pcr performed the worst, might be due to some feature of the dataset we’re not yet aware of. What’s odd is the dataset has a particular structure that we expect random forest to catch, but its accuracy was lower than expected.

## [1] "lmStepAIC"
##         RMSE     Rsquared          MAE 
## 5789.4995408    0.7752191 4098.9541191 
## [1] "pcr"
##         RMSE     Rsquared          MAE 
## 1.127058e+04 1.448556e-01 8.738885e+03 
## [1] "pls"
##         RMSE     Rsquared          MAE 
## 5770.8918286    0.7766102 4078.9038450 
## [1] "ridge"
##         RMSE     Rsquared          MAE 
## 5764.8624538    0.7770745 4072.4283348 
## [1] "lasso"
##         RMSE     Rsquared          MAE 
## 5808.5388723    0.7754018 4107.0449266 
## [1] "enet"
##         RMSE     Rsquared          MAE 
## 5764.8624538    0.7770745 4072.4283348 
## [1] "knn"
##         RMSE     Rsquared          MAE 
## 5376.7569808    0.8050967 3216.6274003 
## [1] "gam"
##         RMSE     Rsquared          MAE 
## 5744.1237601    0.7781097 4102.8114564 
## [1] "rf"
##         RMSE     Rsquared          MAE 
## 4522.2698037    0.8634668 2527.1010350

Clustering task: Predicting the clusters with higher insurance charges

The direct implication is helping us establish better choices in our lifestyle to lower our insurance charge. But with this information, researchers would also understand which demographic is mostly affected by these medical charges, and make it a basis to solving bigger problems.

K-means model

The elbow method showed us the number of clusters should be 3-4. Since children is rather similar to a categorical variable, we can continue with the other 3 variables.

K-means model gave us a very insightful look at the clusters. The overall visual accuracy might be due to the fact that the dataset was simulated. What’s interesting is when we increase the cluster number from 3 to 4, there seems to appear another cluster that’s divided by age from the lower insurance charge cluster.

Hierarchical models

A quick glance tells us we should skip the single method, because to the right we see a very intertwined branch. This will make it harder to interpretate the results as well as won’t give us a clear cluster.

From the result of the elbow test earlier, we’ve decided that this dataset will have 3-4 clusters. We’ll now cut the branches accordingly.

Visualizing the clusters

The average linkaged model showed some prospects, but only managed to capture 2 clusters when we look at the scatterplot. Meanwhile the complete linkage model seems to divide the upper datapoints using a combination of age and BMI.

Interestingly, the hierarchical model for 4 cluster only captured some outliers for its 4th cluster. The correlation-based approach also produced a horrible result, with 1 cluster being the majority. We won’t be going any further on these.

What’s interesting is once we’ve seen the plot for the main numerical variable against smoker, we start to have a grasp in the results of the above cluster.

Most if not all the lower insurance charges are non-smoker. The middle layer of insurance charge is vaguely divided by a BMI of 30 (the threshold for obesity). And the highest layer of insurance charge is all smokers. We get a clearer insight of the dataset’s trend, organizing clusters with insurance charge in mind:

  • Obese smokers > Smokers > Obese non-smokers > non-smokers
  • Insurance charge has a positive correlation with age

What’s more interesting is the best model for 4 clusters divided the 4th one among age rather than bmi. With this information we can understand that even if the workings of the algorithm on a large dataset to be a black box, human insight is still vital.

Conclusion

The most simple advice we can give you, that you might have already know, is to quit smoking and take care of your health.